scGate: Purification of cell types from complex single-cell RNA-seq data. Basic examples.

Setup environment

renv::activate()
renv::restore()
library(ggplot2)
library(dplyr)
library(patchwork)
remotes::install_github("carmonalab/scGate",ref = "dev")
library(scGate)

Load demo datasets

testing.datasets <- scGate::get_testing_data(version = "hsa.latest")
dir.create("./plots")

Let’s play with a PBMC dataset (“Satija”) from Hao et al 2021 (https://doi.org/10.1016/j.cell.2021.04.048)

obj <- testing.datasets[["Satija"]]
DimPlot(obj, label = T, repel = T, group.by = "celltype.l1") + theme(legend.position = "none",
    aspect.ratio = 1)

ggsave("plots/umap.satija.celltype.3.pdf", width = 3, height = 3)

Creating a simple gating model

Now let’s setup a simple scGate gating model to purify a population of interest, for instance, B cells. B cells are usually identified by the expression of CD20, encoded by MS4A1

# create a simple gating model for scGate
my_scGate_model <- gating_model(name = "Bcell", signature = c("MS4A1"))
my_scGate_model
  levels   use_as  name signature
1 level1 positive Bcell     MS4A1

Purify a cell population of interest with scGate

# run scGate
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

ggsave("plots/umap.satija.Bcell.pdf", width = 4, height = 4)

Because in this demo example we know the ground truth (at least, as defined by the authors), let’s evaluate scGate predictive performance in terms of precision/positive predictive value, recall/sensitivity, and accuracy using Matthews Correlation Coefficient (MCC)

scGate::performance.metrics(grepl("^B ", obj$cell_type), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9967742 0.9967742 0.9961825 

With this very simple model, >99% (Recall) are isolated with a >99% purity (Precision), and and overall accuracy/MCC >99%

In another example, let’s gate plasmacytoid dendritic cells (pDCs), which can be isolated using marker LILRA4 (eg https://pubmed.ncbi.nlm.nih.gov/30395816/)

my_scGate_model <- gating_model(name = "pDC", signature = c("LILRA4"))  # add one positive signature
my_scGate_model
  levels   use_as name signature
1 level1 positive  pDC    LILRA4
# run the model
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

scGate::performance.metrics(obj$cell_type == "pDC", obj$is.pure == "Pure")
PREC  REC  MCC 
   1    1    1 

Gating models with positive and negative markers

Another example for gating NKs, which express KLRD1. However, KLRD1 can also be expreseed by some T cell subsets, so we can filter these out by including “negative” T cell markers such as CD3D

my_scGate_model <- gating_model(name = "NK", signature = c("KLRD1", "CD3D-"))

# run the model
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

ggsave("plots/umap.satija.NK.pdf", width = 4, height = 4)

Let’s evaluate performance

scGate::performance.metrics(grepl("^NK", obj$cell_type), obj$is.pure == "Pure")
     PREC       REC       MCC 
1.0000000 0.8537736 0.9160913 

With this simple model, >97% of NKs (recall) are isolated with a 96% purity (Precision), and overall accuracy/MCC of 96%

Hierarchical gating models

Here we have been using a dataset derived from blood (PBMC). When working with more complex tissues, such as tumors, we might need to apply more stringent criteria. That’s why scGate allows to purify by steps, using a hierarchical gating model.

Let’s explore the whole tumor dataset by Jerby-Arnon et al 2018 (https://doi.org/10.1016/j.cell.2018.09.006)

obj <- testing.datasets[["JerbyArnon"]]
DimPlot(obj, label = T, repel = T, group.by = "cell_type") + theme(legend.position = "none",
    aspect.ratio = 1)

ggsave("plots/umap.jerby.celltype.3.pdf", width = 3, height = 3)

Here we have non-immune populations such as malignant/cancer cells (Mal), Endothelial cells (Endo) and cancer-associated fibroblasts (CAF). We can try to purify a broadly defined cell population, for instance, all immune cells, using pan-immune cell marker CD45 encoded by PTPRC:

my_scGate_model <- gating_model(name = "immune", signature = c("PTPRC"))
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

ggsave("plots/umap.jerby.immune.pdf", width = 4, height = 4)
scGate::performance.metrics(obj$cell_type %in% c("B.cell", "NK", "T.CD4", "T.CD8",
    "T.cell", "Macrophage"), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.8999310 0.9775112 0.8014946 

Now let’s try to recover macrophages (eg using common markers CD68 and FCGR1A). Instead of purifying macrophages directly from the whole tissue, we can try to do it in two steps (hierarchical strategy): i) isolate immune cells as in the previous example, ii) isolate macrophages from immune cells. Hierarchical gating models can be specified in scGate using parameter “level”, as follows:

my_scGate_model <- gating_model(name = "immune", signature = c("PTPRC"), level = 1)  # initialize model with one positive signature
my_scGate_model <- gating_model(model = my_scGate_model, name = "macrophage", signature = c("CD68",
    "FCGR1A"), level = 2)  # add positive signature at second step
obj <- scGate(data = obj, model = my_scGate_model, assay = DefaultAssay(obj))
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

ggsave("plots/umap.jerby.macro.pdf", width = 4, height = 4)
scGate::performance.metrics(obj$cell_type %in% c("Macrophage"), obj$is.pure == "Pure")
     PREC       REC       MCC 
0.9068826 0.9218107 0.9023599 

Here we isolated 92% of macrophages with a 91% purity (according to annotation by Jerby-Arnon et al). We could easily improve PREC and REC by adding more positive and negative markers, respectively (e.g. removing lymphocytes)

We can see how the two-level gating model worked in each step using scGate’s plot_levels. We see on the left the immune cell purification step, and on the right, the macrophage purifications step.

wrap_plots(plot_levels(obj))

Editing models

scGate models of arbitrary complexity can be easily constructed to achieve high purification performance. These gating models can be written using the gating_model() function, and manually edited using eg R’s fix(). You can also export your basic model as tabulated text and edit it manually on Excel

fix(my_scGate_model)  # edit locally in R
write.table(my_scGate_model, "test.tsv", sep = "\t")  # export and then edit on Excel
my_scGate_model_edited <- scGate::load_scGate_model("test.tsv")  # tabulated-text model can loaded

Fast evaluation of model performance

We provide a built-in function test_my_model to automatically evaluate the performance of a gating model using 3 pre-annotated testing datasets:

my_scGate_model <- gating_model(name = "Bcell", signature = c("MS4A1"))
panBcell.performance <- test_my_model(my_scGate_model, target = "Bcell")

panBcell.performance$performance
                PREC       REC       MCC
JerbyArnon 0.8550725 0.9711934 0.8984003
Zilionis   0.9572193 0.9521277 0.9499812
Satija     0.9967742 0.9967742 0.9961825

We see that in one of the datasets our simple model did not have an excellent precision (85%). We can refine it, for instance, by removing potentially contaminating T cell genes (CD3D) and try again:

my_scGate_model <- gating_model(model = my_scGate_model, name = "Tcell", signature = c("CD3D"),
    negative = T)
panBcell.performance <- test_my_model(my_scGate_model, target = "Bcell")

panBcell.performance$performance
                PREC       REC       MCC
JerbyArnon 0.9446809 0.9135802 0.9193858
Zilionis   0.9572193 0.9521277 0.9499812
Satija     0.9967638 0.9935484 0.9942680

Now performance is very high for the three datasets.

Using pre-defined gating models

We also provide some pre-defined gating models, these can be retrieved with get_scGateDB()

Load default models from DB

models.DB <- scGate::get_scGateDB(force_update = T)

names(models.DB$human$generic)
 [1] "Bcell"           "CD4T"            "CD8T"            "MoMacDC"        
 [5] "Myeloid"         "NK"              "PanBcell"        "Plasma_cell"    
 [9] "Tcell"           "Tcell.alphabeta"

names(models.DB$mouse$generic)
[1] "CD4T"            "CD8T"            "MoMacDC"         "Myeloid"        
[5] "Tcell"           "Tcell.alphabeta"

Visualizing gating models

To visualize complex models in tree-like structures, we provide plot_tree() function This is a model to purify plasma (B) cells with high accuracy

my_scGate_model <- models.DB$human$generic$Plasma_cell
plt.tree <- scGate::plot_tree(my_scGate_model)
plt.tree

Run an example, now on another tumor dataset by Zilionis et al 2019 (https://doi.org/10.1016/j.immuni.2019.03.009)

obj <- testing.datasets[["Zilionis"]]
DimPlot(obj, label = T, repel = T, group.by = "cell_type") + theme(legend.position = "none",
    aspect.ratio = 1)

# run scGate
obj <- scGate(obj, model = my_scGate_model)
DimPlot(obj, cols = c(list(Impure = "gray", Pure = "green"))) + theme(aspect.ratio = 1)

Visualizing filtering results by level. We can see how the purity increases at each level.

plots <- scGate::plot_levels(obj)
plots[["annotation"]] <- DimPlot(obj, group.by = "cell_type", label = T, repel = T,
    label.size = 3) + ggtitle("Author's manual annot") + NoLegend()
wrap_plots(plots, ncol = 2)

Evaluate performance of filtering compared to original annotation

perf.scGate = scGate::performance.metrics(actual = obj$Plasma_cell, pred = obj$is.pure ==
    "Pure")
perf.scGate
     PREC       REC       MCC 
0.9255319 0.9456522 0.9323972